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Due to the complex nature of a pandemic such as COVID-19, forecasting 
how it would behave is difficult, but it is indeed of utmost necessity. 
Furthermore, adapting predictive models to different data sets obtained from 
different countries and areas is necessary, as it can provide a wider view of 
the global pandemic situation and more information on how models can be 
improved. Therefore, we combine here the long-short-term memory (LSTM) 
model and the traditional susceptible-infected-recovered-deceased (SIRD) 


Keywords: model for the COVID-19 prediction task in Ho Chi Minh City, Vietnam. In 
COVID-19 particular, LSTM shows its strength in processing and making accurate 
: numerical predictions on a large set of historical input. Following the SIRD 
Data visual model, the whole population is divided into 4 states (S), (I), (R), and (D), 
LSTM network and the changes from one state to another are governed by a parameter set. 
Real-time forecasting By assessing the numerical output and the corresponding parameter set, we 
SIRD model could reveal more insights about the root causes of the changes. The 
predictive model updates every 10 days to produce an output that is closest 
to reality. In general, such a combination delivers transparent, accurate, and 
up-to-date predictions for human experts, which is important for research on 

COVID-19. 
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1. INTRODUCTION 

Since the first outbreak in Wuhan, Hubei, China (Nov. 2019), SARS-CoV-2, or COVID-19, has 
rapidly spread over the world with the total infected cases and total deaths of, respectively, about 490 and 
6.15 million (Mar. 2022) [1]. Indeed, the COVID-19 disease was so contagious that it was declared 
"pandemic" (Mar. 2020) [2] shortly after WHO officially announce it the "Public Health Emergency of 
International Concern" (Jan. 2020) [3]. As Vietnam shares the border with China, COVID-19 rapidly became 
the most important public concern for Vietnamese society, government, and academia in early 2020 [4]. 
Studies in this early stage serve mainly as reports of infected cases [5], the reaction of the society reflected in 
the risk assessment of the pandemic [6]. Despite the early successes in containing the virus and maintaining 
economic growth [7], [8] with strict response, Vietnam was and has been severely affected by the disruption 
of global supply chains, causing a crisis and stagnation in hospitality, the national economy (including 
tourism). Among the most popular models used for predicting infectious disease outbreaks is the susceptible- 
infectious-recovered-deceased (SIRD) model [9]. As its name suggests, the model includes four states: 
susceptible (S) - people have the infection risk, infected (I) - people are infected, recovered (R) - people 
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recover from the disease, and deceased (D) - people pass away because of the disease. It should be noted that 
in a conventional SIRD, infected cases are assumed to gain immunity after recovery to decrease the 
computational complication when considering re-infected cases [10], [11]. Aside from the SIRD model, the 
COVID-19 pandemic has been studied by means of logistic model [12], confinement and quarantine model 
[13], Bayesian and stochastic techniques [14], [15]. However, predicting pandemics with mathematical 
models, which have been mainly developed from [16], could be ambiguous because it is hard to avoid bias in 
selection of the model parameters. 

In recent years, self-learning algorithms categorized as deep learning (DL), a subbranch of machine 
learning (ML), have been developed for various tasks such as natural language processing, face and object 
detection, speech recognition. [17]. DL shows its strength in processing big data with relatively high 
accuracy in comparison to other traditional algorithms and can be fine-tuned to obtain even better results for 
particular applications. For instance, DL has been applied in various fields such as in the hospitality industry 
[18], legal field [19], stock market prediction [20], and traffic anomaly detection [21]. Furthermore, DL is a 
promising approach for processing the daily records of COVID-19 infected cases that increase massively 
amidst the pandemic [22], [23]. The infected cases over the course of time are fed into the DL models such as 
recurrent neural network (RNN) [24] and long short term memory (LSTM) [25], which are specialized for 
processing sequential history data. As a result, predictions based on historical inputs from these DL models 
are with remarkably high accuracy. However, the well-packed layer structures of the DL models often appear 
as black boxes, which provide no insight into how the predictions were obtained. Therefore, the analyzers 
face difficulties in interpreting the results to obtain the root causes and make appropriate counteractions. In 
view of this, interpretable machine learning has been developed aiming to offer the analyzers explanations 
for the results while maintaining computational power [26]. 

To mitigate the impact of COVID-19, prediction of infected cases, recoveries, and deaths is 
important, so that policy makers can prepare preventive measures for the worst scenarios in advance. This 
paper aims to fulfill this forecasting task utilizing the LSTM DL model in combination with the SIRD model. 
Over the course of the problem formulation, readers would understand how the two models are adopted and 
combined to yield accurate predictions with helpful insights, which can facilitate the making of public health 
policies. 


2. METHOD 

This study serves as a forecast for the COVID-19 situation in Ho Chi Minh City, Vietnam. First, 
sequential historical data are input into LSTM so that the main features of the data can be extracted. Second, 
the predicted outputs from LSTM are considered as optimal parameters for SIRD model, which is then used 
to predict the exact COVID-19 cases. The objections that we have to overcome are: processing massive 
amount of historical data, complicated computational model, long computational time, and making the 
predicted results interpretable while ensuring that they best reflect the real pandemic situation. These goals 
can be achieved using the approach illustrated in Figure 1. 
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Figure 1. Proposed approach 


2.1. Predictive modelling-SIRD model 

The SIRD model was used to model the COVID-19 spread [26], [27]. The model starts with (S), 
where people are healthy. A population of (S) moves to (I) when they are infected with COVID-19. From (I), 
they can either move to (R) (people are cured), or to (D) (people pass away, deceased or death). As 
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mentioned above, the transition from (I) to (R) is irreversible because it was assumed that infected people 
would gain immunity to the disease after recovering from it. Besides, we assume three parameters for the 
SIRD model are (f, y, 4), which are respectively the infected rate per time unit, recovery rate per time unit, 
and death rate per time unit. The total population is denoted as N = S + R + I + D. Then, the SIRD model 
would include the first-order derivative, making up a system of four differential equations at the initial time 
(t) that epidemic was recognized, as (1.1) to (1.4). 


ae ae 
a à peo -yI —ul(t), (1.2) 
Z yI, ve 
2 = I), a 


At a particular time (t) and specified (f,y,M), we are able to obtain a unique set of 
S(t), I(t), R(t), D(t), which are the number of people respectively in each state [27]. Indeed, estimation of 
(f,y,-) is complicated and sometimes biased [28], but it is essential for the accuracy of the pandemic 
prediction with SIRD model. The three parameters (6, y, u) can be extracted from the historical data and by 
assessing the key parameters B,y,and u, we can understand some insights of the pandemic. It should be 
noted that the time range is not given because each pandemic differs from one another. 


SS aR ST. (2) 


Subsequently, the variation of recovered cases per time unit and infected cases per time unit can be derived 
from (1.2) and (1.3) as (3) and (4). 


R'=yxXl. (3) 

=BxSIl-y-u. (4) 
Eventually, we can obtain the variation of deceased cases per time unit from (1.4) as (5). 

D'=uxl. (5) 


Assessing these variations together with the key parameters can help us to get the insights of the 
pandemic situation. For example, if D (t) has the tendency to increase while u is high and £ is small, it can be 
drawn that new deceased cases does not stem from the newly infected but from the serious health issues of 
the infected cases. On the other hand, it means that the contagion of COVID-19 has been under some control. 
Additionally, there may be some problems with the healthcare system, which is the reason for the increasing 
number of deaths. This would encourage governments to propose appropriate and up-to-date policies, such as 
lifting COVID restrictions and reviewing healthcare policies. By interpreting the obtained results in this 
manner, the SIRD model has become no longer inexplicable. 

Remarkably, obtaining an accurate set of parameters (6,y,) is inessential because it varies for 
cases, i.e., regions with different geography, economy, social policies, and capacity of healthcare systems. 
Because the parameters are variations of time, they can best describe the real situation only if the estimation 
is limited to a particular region in a short and particular time. Therefore, processing a massive amount of 
COVID-19 data from all around the world is nontrivial and indeed impossible because one cannot predict all 
potential changes that would happen to the COVID situation in the world. At the start of the pandemic, when 
the public is not aware of the disease, p describes the natural infection rate [28] in society. During the course 
of time, 6 would change as people perform hygiene practices and social distancing along the health 
regulations enforced by the government. On the other hand, y reflects the number of recoveries within a time 
range, the inverse of which is the days an individual is infected, y~1. Thus, we can estimate the y with the 
average recovery time of the individual from the medical records. Together with u, which shows the rate of 
deceased cases within a time range, y indicates how well the healthcare system of a region under 
consideration performs [29]. As explained, to describe the dynamic of an unknown disease, which has the 
potential to become a pandemic, estimation of parameters (6, y,) is the utmost important task, which are 


Deep learning application for real-time prediction of COVID-19 outbreak with ... (Hoang-Sy Nguyen) 


570 m) ISSN: 2502-4752 


later presented in detail. According to [30], healthcare restriction is the most effective way to control the 
widespread of the pandemic and decrease the infected as well as the decreased cases. 

To consider this dependence of the parameters, we employ probability, pọ as studied in [31] to 
represent the number of infected cases caused by an infected case (reproduction number) as (6): 


(i), if po > 1 


S eE 
Po = Eso, (ii), otherwise (6) 


in which, (i) indicates that there is a reduction in the newly infected cases and a tendency of the pandemic to 
be moderated. (ii) indicates the exponential and uncontrollable growth of the pandemic. The formula 
emphasizes the importance of social distancing, since for a given £ in a particular time of pandemic, pọ can 
only be decreased if we reduce the number of people that are potentially infected. The interrelation between 
the key parameters inspired us to employ the DL model for a more accurate estimation of the three 
parameters (£, y, u) from the historical data. In addition, utilizing the DL approach in this manner also makes 
the predictive DL model more transparent for medical experts. 

Finally, we proposed a hybrid model architecture by combining it with ideas of employing 
interpretative machine learning techniques to automatically find the best approximated parameters as 
presented in Figure 2. The descriptions are discussed in detail as follows. Step 1: extracting the key features 
from the historical data. Given a historical data and population N as the input for the LSTM layers, we first 
produce a collection of feature vectors AS;, Al;, AR;, AD;, respectively, which describe the concentration of 
COVID-19 on a set of local contaminated surfaces as (7) to (10). 


= _ Stk+1) _ Sk) 


AS, == (7) 
Ay = (8) 
AR, =, (9) 
ÅD = D(k+1) _ DE) (10) 


N N 


It should be noted that indirect infection is considered, but only at a random time every day for each 
individual case, because the indirect infection rate is relatively small. 

Step 2: encoding and decoding for SIRD training. The output of LSTM is flattened to a 1D vector so 
that it could be encoded with the variational autoencoder (VAE) [32]. One of the most typical encoding 
technique namely Multilayer Perceptron is employed [33]. The decoder is designed to generate the output, 
which is a matrix size of n X 4, the same as the input. Since the matrix in the output records the data from 
2”4 to (n + 1)" day from the set of historical data, it enables automatic labeling for the encoding-decoding 
process. Euler method [34] is utilized to decode the matrix output according to the parameters (£,y, u) 
yielded from the previous learning process. Specifically, based on the kt” day, the decoder estimates the 
number of cases in (S) D, (R), (D) states at (kK+1)* day as S(k+1),1(k+1),R(k +1), 
D(k + 1), respectively: 


S(k+1) _ SK) _ p S(K) I(k) 


N N B N N’ (11) 

D O y SO) 1) I 10) ax 
N N N N N N 

RUHL)! BUD) i I (13) 
N N N 

D(k+1) _ DUY Ik) (14) 


N N N’ 


Additionally, during training, after the data is fed through the SIRD's layers, we employ mean 
squared error (MSE) for loss function to determine the true error rate for our prediction [35]. Besides, for 
such a historical data set with lots of noises, Adam optimizer is appropriate for weight updating task [36]. 
Subsequently, we can calculate the MSE of a given epoch for loss function as (15): 
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where Wp, W, are k-th vector respectively from the historical data and the prediction, considering the output 
matrix sizing of n X 4 discussed previously. Remark: by means of loss function, it is possible to update the 
system's weights following the back propagation method. It is important to mention that the decoder is not 
updated throughout the training process because it is not assigned with trainable weights. The training 
process works so that when the results of the loss function are stable, the prediction of (ß,y, u) from any 


historical data is announced successful. 
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Figure 2. Proposed a conbine SIRD-LSTM model 


2.2. Data source and visualization 

We investigate in this study two sets of data, i.e., from COVID-19 updates of the Vietnamese 
Ministry of Health (MOH) [37] and from coronavirus resource center of John Hopkins University [38] (with 
data set in [39]). Figure 3 illustrates the total infectious (T) case in Ho Chi Minh city (including the infected, 
recovered, and deceased cases) from Jul. 2021 to Feb. 2022. Python programming language was used to 
collect and process the data. The historical data input is updated once every 10 consecutive days. The 
predictive model requires inputs that are classified into (S)-(1)-(R)-(D) states. However, from the available 
sources, only (R), (D), and (S;: including (I), (R), and (D)) can be extracted. Thus, we need to consider the 
whole population in Ho Chi Minh city to calculate the other inputs. Then, the inputs are normalized by 
dividing all the classified data by the population in Ho Chi Minh City, to obtain Cs = N — S, Cr = S; — R — 
D, Cr = R, Cp = D, which are the percentages of S-I-R-D cases, respectively. The total population is N and 
other parameters were aforementioned. 
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Figure 3. Total infectious cases in Ho Chi Minh City (Jul. 2021 to Feb. 2022) 


3. RESULTS AND DISCUSSION 

The model works so that data in the previous days is combined with parameters (8, y, 4) to predict 
the S, I, R, D cases in the upcoming days. Notably, the key features selected from the input data are 
important for the estimation of parameters (, y, 4), thus, must be chosen carefully. The goal of the proposed 
hybrid predictive model is to accurately approximate the parameters (6, y, 4) based on which future cases can 
be calculated (with lowest error in comparison with reality) and their characteristics can be inferred. Another 
factor that would affect the prediction is day lag n, which is in the range of 5 to 31 in [40]. We utilize 
recursive strategy [41] for k-step forecasting to determine the most optimal n lag days for the proposed 
model. As for the accuracy assessment, the error R? between the predicted and the real cases can be 
calculated as R?=1 =e where ¢,,e,denotes the residual sum of squares, and total sum of squares, 


respectively. Ideally, R <1, and the smaller the error is, the more explainable the SIRD model is. 
Subsequently, we can calculate the mean of the total percentage errors for k-step forecasting metric as 
MAPE = DI |W,, — W,|/W,, where Wp, W, denote for the actual cases and predicted cases. The better result is 
with the smaller MAPE < 0. 

The historical data obtained from the time range between Jul. 2021 and Feb. 2022 was used for 
training the DL model. The prediction is realized in a way that 28 days in the past are used to forecast the 
next 28 days. The below figures illustrate the actual historical data (from late Jan. 2022 to late Feb. 2022) and 
predicted daily cases (from late Feb. 2022 to late Mar. 2022). The predictions were realized for three 
scenarios (normal, best and worst) that would happen with the COVID situation in Ho Chi Minh City, 
Vietnam. The trends over the predicted days reflect different insights, with regard to different parameter 
values (8, y, 4). In the case of prediction made by black-box models, for high number of deceased cases, the 
most common conclusion would be that the current health system could not handle the rising number of 
infectious cases. Regarding the same prediction made with SIRD model, if we assess the parameters (£, y, 4) 
and would see that the high deceased number is accompanied by the increase of recovery rate and the 
decrease of transmission as well as the deceased rate, we could conclude that the pandemic situation has been 
under control in the area under study. Other reasoning can be made when we assess the raise or fall of cases 
in combination with the parameters (£, y, ). 

From Figure 4 to Figure 7, the number of predicted cases for the three different scenarios are 
presented. All the predicted scenarios share one thing in common, i.e., the number of cases stabilize over 
time. In Figure 4 from the recent historical data set, the new daily cases are predicted to decrease over time 
for all three scenarios. This is a positive sign which indicates that the healthcare measures taken so far by 
the government in Ho Chi Minh City is effective in the fight against the pandemic. Accompanying this, in 
Figure 5, how well the healthcare system has been navigated is also shown by the increase in daily recovered 
cases in the next days for all three scenarios. In Figure 6, the number of deceased cases in best and normal 
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scenarios are predicted to be relatively small. This is reasonable because at this stage of the pandemic 
situation, a majority of the population has gain immunity either by vaccination or recovering from previous 
infection. Moreover, it could be as well because the new COVID-19 variant has been less severe. As can be 
observed in Figure 7, in the worst scenario, the deceased cases grow exponentially to a high number. Even 
though this may appear to be unrealistic, it still can warn the government about how severe the situation 
could be so that appropriate plans can be prepared in advance to contain the pandemic in the worst case. 
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Figure 5. Forecast of new daily recovered cases 
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Figure 6. Forecast of new daily deceased causes (normal case and best case) 
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Figure 7. Forecast of new daily deceased causes (worst case) 


In general, considering the presented results and the insights they provide, we can proactively cope 
with the pandemic changes. For a fixed training data set and stable pandemic situation, the SIRD model can 
forecast well the situation. Otherwise, in the case of sudden changes in the pandemic, more time needs to be 
taken for the model to predict accurately. To overcome this, the SIRD model is often updated with the help of 
reinforcement learning to stay up-to-date with reality. We recommend to update the historical data input on 
more or less weekly basis for the best prediction. Impact of COVID-19 on the economics and politics have 
been surprisingly severe and long lasting. According to report from World Health Organization (WHO), from 
Dec. 2019 up to Mar. 2022, there have been approximately 494.5 million infected cases and over 6.17 million 
deaths were recorded all over the world. To reduce the life loss and mitigate the impacts of COVID-19 on 
many aspects of a country or region, policy makers must be provided with the most accurately predicted 
scenarios to propose appropriate moves, which is why predictive models for pandemic situations are 
important. Regarding the pandemic prediction, it is essential to take into consideration the model's ability to 
rapidly adapt to pandemic changes and reflect them to the predicted results. Therefore, it is a promising 
approach to utilize the SIRD model for pandemic description as well as DL approach to process the historical 
data for the key parameters (£, y, u). This helps the predictive model to perform well and maintain itself for 
long-term predicting task. 

We compare the accuracy of the results obtained from the proposed combined SIRD-LSTM model 
and from other popular forecasting models (both statistical and machine learning models) in Table 1, i.e., 
least absolute selection Shrinkage operator (LASSO) [42], support vector machine for regression (SVR) [43], 
decision tree regression (DTR) [44]. We use the same n-day lag for all the models. It can be observed that 
similar to other popular predictive methods, SIRD model can predict with high accuracy only when it is fed 
with a sufficient amount of input data. The predicted results are of relatively high accuracy in comparison 
with other methods. 


Table 1. Comparison of the forecasting model 
Rate Total infectious Recovered Deceased 
Method R? MAPE R? MAPE R? MAPE 
SIRD-LSTM 0.9979 0.0051 0.9878 0.0187 0.9890 0.0068 
LASSSO [42] 0.9991 0.0011 0.9666 0.0115 0.5365 0.0402 
SVR [43] -18.1162 0.6208 -13.4640 0.0187 -46.3249 0.5586 
DTR [44] -2.8516 0.2324 -3.0777 0.2922 -2.8120 0.1325 


4. CONCLUSION 

In conclusion, due to the complex nature of the pandemic, it is not sufficient for a predictive model 
to provide only the closest prediction to reality. Indeed, the prediction must be interpretable so that human 
experts can base on that to make appropriate decisions. In view of this, we investigated the combination 
SIRD-LSTM of SIRD model to describe the pandemic and DL approach to identify key parameters ($, y, 4) 
to best fit the SIRD model to reality. Implementation of other semi-supervised learning models could make 
the training process become more efficient. Besides reviewing the COVID-19 situation in Vietnam and 
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conducting a predictive case study in Ho Chi Minh City, we also explain how to build the predictive model 
and interpret the results. This study can be applied for other regions or prediction of other diseases as long as 
the historical data is sufficiently available, thanks to the adaptability that DL offers. This paper would serve 
as a starting point for further in-depth research into the many questions related to this pandemic. 
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